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ABSTRACT 

Current research on audio source separation provides tools to 
estimate the signals contributed by different instruments in 
polyphonic music mixtures. Such tools can be already incor- 
porated in music production and post-production workflows. 
In this paper, we describe recent experiments where audio 
source separation is applied to remixing and upmixing exist- 
ing mono and stereo music content. 


1. AUDIO SOURCE SEPARATION USING DEEP 

NEURAL NETWORKS 


Audio source separation algorithms have progressed a long 
way in recent years, moving on to algorithms that exploit prior 
information in order to estimate time-frequency masks |T|. 


For example Deep Neural Networks (DNN), are used in a 
supervised setting that strongly depends on available train- 
ing data. In exchange, using supervised training frees them 
from assumptions needed in other algorithms, such as having 
recordings from multiple microphones or dealing with repet- 
itive music structures. DNNs are trained to estimate time- 
frequency masks which still rely on the assumption that en- 
ergy from different sound sources does not overlap in the 
time-frequency plane. While applying hard (binary) masks 
to spectrograms achieves good separation, many noticeable 
artifacts are introduced. Soft masks produce better sounding 
results, but imperfect separation. Results from soft masks can 
still be recombined in remixing and upmixing applications. 
In this paper we describe two recent prototypes that allow re- 
purposing of musical audio using popular instrument classes. 
While perceptual evaluation is still pending, both can be used 
to provide convincing results. 


2. REPURPOSING MUSICAL AUDIO 


The general idea is to use time-frequency masks estimated 
from DNN models [2] to upmix and remix musical audio. 
This means that we are able to make audio content interactive 
by providing the user with controls for remixing or upmixing, 
not unlike using an intelligent equalizer that knows about the 
instrument sounds in the mixture. Our prototypes use models 
trained using the dataset from the SiSEC MUS challenge [3 1, 
where sources have been consistently annotated according to 
common popular music instrument categories ( vocals , bass, 




Figure 1 : Block diagram of the remixing system. 


drums , other). Figure |T] shows a diagram of our remixing 
prototype, presented at the 2nd Web Audio Conference [4]. 
In this case, one DNN was trained for each instrument. The 
predictions of each model are used as probability estimates 
of the corresponding instrument in each time-frequency bin. 
One slider controls a global minimum threshold for the es- 
timates. The user can then re- scale the magnitude of each 
bin with instrument- specific sliders. While total soloing or 
muting of specific instruments cannot be achieved without ar- 
tifacts, it is possible to obtain good quality remixes for some 
parameter settings. Figure [2] shows a diagram of our upmix- 
ing prototype, which was demonstrated at the 2nd AES Sound 
Field Control conference. The demo was delivered in a 22.2 
system [5|. Here, a single DNN model was trained to pre- 
dict soft masks for each instrument. The resulting channels 
are sent to a Vector Base Amplitude Panning (VBAP) object- 
based sound spatialization engine (under development by the 
S3 A project [6]) and the user interface allows locating differ- 
ent instruments in 3D space. Informal listening tests revealed 
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Figure 2: Block diagram of the upmixing system. 


that no artifacts were perceived during typical use. 

3. CONCLUSIONS 

We have described two prototypes for repurposing musical 
audio. Both systems rely on supervised source separation 
models trained with a specific set of instrument categories, 
and thus only work for music with similar instrumentation. 
Although the quality of separation depends on the difficulty 
of the mixture, it is possible to constrain the user interface to 
produce good results by avoiding extreme configurations. 
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